NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

SpotVerse: Optimizing Bioinformatics Workflows with Multi-Region Spot Instances in Galaxy and Beyond

https://doi.org/10.1145/3652892.3700750

Son, Myungjun; Akbulut, Gulsum Gudukbay; Kandemir, Mahmut Taylan (December 2024, ACM)

Full Text Available
FAAStloop: Optimizing Loop-Based Applications for Serverless Computing

https://doi.org/10.1145/3698038.3698560

Mohanty, Shruti; Bhasi, Vivek M; Son, Myungjun; Kandemir, Mahmut Taylan; Das, Chita (November 2024, ACM)

Full Text Available
Paldia: Enabling SLO-Compliant and Cost-Effective Serverless Computing on Heterogeneous Hardware

https://doi.org/10.1109/IPDPS57955.2024.00018

Bhasi, Vivek M; Sharma, Aakash; Mohanty, Shruti; Kandemir, Mahmut Taylan; Das, Chita R (May 2024, IEEE)

Among the variety of applications (apps) being deployed on serverless platforms, apps such as Machine Learning (ML) inference serving can achieve better performance from leveraging accelerators like GPUs. Yet, major serverless providers, despite having GPU-equipped servers, do not offer GPU support for their serverless functions. Given that serverless functions are deployed on various generations of CPUs already, extending this to various (typically more expensive) GPU generations can offer providers a greater range of hardware to serve incoming requests according to the functions and request traffic. Here, providers are faced with the challenge of selecting hardware to reach a well-proportioned trade-off point between cost and performance. While recent works have attempted to address this, they often fail to do so as they overlook optimization opportunities arising from intelligently leveraging existing GPU sharing mechanisms. To address this point, we devise a heterogeneous serverless framework, PALDIA, which uses a prudent Hardware selection policy to acquire capable, costeffective hardware and perform intelligent request scheduling on it to yield high performance and cost savings. Specifically, our scheduling algorithm employs hybrid spatio-temporal GPU sharing that intelligently trades off job queueing delays and interference to allow the chosen cost-effective hardware to also be highly performant. We extensively evaluate PALDIA using 16 ML inference workloads with real-world traces on a 6 node heterogeneous cluster. Our results show that PALDIA significantly outperforms state-of-the-art works in terms of Service Level Objective (SLO) compliance (up to 13.3% more) and tail latency (up to ∼50% less), with cost savings up to 86%.
more » « less
Full Text Available
Usas: A Sustainable Continuous-Learning Framework for Edge Servers

https://doi.org/10.1109/HPCA57654.2024.00073

Mishra, Cyan Subhra; Sampson, Jack; Kandemir, Mahmut Taylan; Narayanan, Vijaykrishnan; Das, Chita R (March 2024, IEEE)

Edge servers have recently become very popular for performing localized analytics, especially on video, as they reduce data traffic and protect privacy. However, due to their resource constraints, these servers often employ compressed models, which are typically prone to data drift. Consequently, for edge servers to provide cloud-comparable quality, they must also perform continuous learning to mitigate this drift. However, at expected deployment scales, performing continuous training on every edge server is not sustainable due to their aggregate power demands on grid supply and associated sustainability footprints. To address these challenges, we propose Us.as,´ an approach combining algorithmic adjustments, hardware-software co-design, and morphable acceleration hardware to enable the training of workloads on these edge servers to be powered by renewable, but intermittent, solar power that can sustainably scale alongside data sources. Our evaluation of Us.as on a real-world´ traffic dataset indicates that our continuous learning approach simultaneously improves both accuracy and efficiency: Us.as´ offers a 4.96% greater mean accuracy than prior approaches while our morphable accelerator that adapts to solar variance can save up to {234.95kWH, 2.63MWH}/year/edge-server compared to a {DNN accelerator, data center scale GPU}, respectively.
more » « less
Full Text Available
TRIM: crossTalk-awaRe qubIt Mapping for multiprogrammed quantum systems

https://doi.org/10.1109/QSW59989.2023.00025

Khadirsharbiyani, Soheil; Sadeghi, Movahhed; Zarch, Mostafa Eghbali; Kotra, Jagadish; Kandemir, Mahmut Taylan (July 2023, IEEE International Conference on Quantum Software (QSW))

Full Text Available
An architecture interface and offload model for low-overhead, near-data, distributed accelerators

https://doi.org/10.1109/MICRO56248.2022.00083

Baskaran, Saambhavi; Kandemir, Mahmut Taylan; Sampson, Jack (October 2022, 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO))

Full Text Available
Multi-resource fair allocation for consolidated flash-based caching systems

https://doi.org/10.1145/3528535.3565245

Choi, Wonil; Urgaonkar, Bhuvan; Kandemir, Mahmut Taylan; Kesidis, George (November 2022, ACM/IFIP Middleware)

Full Text Available
Stash: A Comprehensive Stall-Centric Characterization of Public Cloud VMs for Distributed Deep Learning

https://doi.org/10.1109/ICDCS57875.2023.00023

Sharma, Aakash; Bhasi, Vivek M; Singh, Sonali; Jain, Rishabh; Gunasekaran, Jashwant Raj; Mitra, Subrata; Kandemir, Mahmut Taylan; Kesidis, George; Das, Chita R (July 2023, IEEE)

Deep neural networks (DNNs) are increasingly popular owing to their ability to solve complex problems such as image recognition, autonomous driving, and natural language processing. Their growing complexity coupled with the use of larger volumes of training data (to achieve acceptable accuracy) has warranted the use of GPUs and other accelerators. Such accelerators are typically expensive, with users having to pay a high upfront cost to acquire them. For infrequent use, users can, instead, leverage the public cloud to mitigate the high acquisition cost. However, with the wide diversity of hardware instances (particularly GPU instances) available in public cloud, it becomes challenging for a user to make an appropriate choice from a cost/performance standpoint. In this work, we try to address this problem by (i) introducing a comprehensive distributed deep learning (DDL) profiler Stash, which determines the various execution stalls that DDL suffers from, and (ii) using Stash to extensively characterize various public cloud GPU instances by running popular DNN models on them. Specifically, it estimates two types of communication stalls, namely, interconnect and network stalls, that play a dominant role in DDL execution time. Stash is implemented on top of prior work, DS-analyzer, that computes only the CPU and disk stalls. Using our detailed stall characterization, we list the advantages and shortcomings of public cloud GPU instances for users to help them make an informed decision(s). Our characterization results indicate that the more expensive GPU instances may not be the most performant for all DNN models and that AWS can sometimes sub-optimally allocate hardware interconnect resources. Specifically, the intra-machine interconnect can introduce communication overheads of up to 90% of DNN training time and the network-connected instances can suffer from up to 5× slowdown compared to training on a single instance. Furthermore, (iii) we also model the impact of DNN macroscopic features such as the number of layers and the number of gradients on communication stalls, and finally, (iv) we briefly discuss a cost comparison with existing work.
more » « less
Full Text Available
Optimizing CPU Performance for Recommendation Systems At-Scale

https://doi.org/10.1145/3579371.3589112

Jain, Rishabh; Cheng, Scott; Kalagi, Vishwas; Sanghavi, Vrushabh; Kaul, Samvit; Arunachalam, Meena; Maeng, Kiwan; Jog, Adwait; Sivasubramaniam, Anand; Kandemir, Mahmut Taylan; et al (June 2023, International Symposium on Computer Architecture 2023)

Deep Learning Recommendation Models (DLRMs) are very popular in personalized recommendation systems and are a major contributor to the data-center AI cycles. Due to the high computational and memory bandwidth needs of DLRMs, specifically the embedding stage in DLRM inferences, both CPUs and GPUs are used for hosting such workloads. This is primarily because of the heavy irregular memory accesses in the embedding stage of computation that leads to significant stalls in the CPU pipeline. As the model and parameter sizes keep increasing with newer recommendation models, the computational dominance of the embedding stage also grows, thereby, bringing into question the suitability of CPUs for inference. In this paper, we first quantify the cause of irregular accesses and their impact on caches and observe that off-chip memory access is the main contributor to high latency. Therefore, we exploit two well-known techniques: (1) Software prefetching, to hide the memory access latency suffered by the demand loads and (2) Overlapping computation and memory accesses, to reduce CPU stalls via hyperthreading to minimize the overall execution time. We evaluate our work on a single-core and 24-core configuration with the latest recommendation models and recently released production traces. Our integrated techniques speed up the inference by up to 1.59x, and on average by 1.4x.
more » « less
An Efficient Edge-Cloud Partitioning of Random Forests for Distributed Sensor Networks

https://doi.org/10.1109/LES.2022.3207968

Shen, Tianyi; Mishra, Cyan Subhra; Sampson, Jack; Kandemir, Mahmut Taylan; Narayanan, Vijaykrishnan (October 2022, IEEE Embedded Systems Letters)

Full Text Available

« Prev Next »

Search for: All records